Run-Length Compressed Indexes for Repetitive Sequence Collections

نویسندگان

  • Veli Mäkinen
  • Gonzalo Navarro
  • Jouni Sirén
  • Niko Välimäki
چکیده

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N logN) bits, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of suffix tree to NHk + o(N log σ) bits at the cost of running times of its operations increasing by polylog(N) factor. Here Hk is the k-th order entropy of the collection and σ is the alphabet size. Notice that for r identical copies of an incompressible base sequence, the bound simplifies to N log σ(1 + o(1)) bits. We develop new static/dynamic full-text self-indexes based on the run-length encoding whose space-requirements are much less dependent on N . For example, we obtain an index occupying R log σ(1+o(1))+R log NR (1+o(1))+r log n+O((s+r) log(s+r)) bits, where s is the total number of basic edit operations to convert the r repeats into substrings of the base sequence, and R ≤ min(n, nHk)+O((s+ r) logσ N), where the O() term holds in the expected case. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree using an additional O((N/δ) logN) bits of space for any δ = polylog(N), and retaining the polylog(N) time slowdown on operations. Computing Reviews (1998)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive se...

متن کامل

Document Listing on Repetitive Collections

Many document collections consist largely of repeated material, and several indexes have been designed to take advantage of this. There has been only preliminary work, however, on document retrieval for repetitive collections. In this paper we show how one of those indexes, the run-length compressed suffix array (RLCSA), can be extended to support document listing. In our experiments, our addit...

متن کامل

Indexing Highly Repetitive Collections

The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.

متن کامل

Universal Indexes for Highly Repetitive Document Collections

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We int...

متن کامل

Storage and Retrieval of Highly Repetitive Sequence Collections

A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008